Here we show the code to reproduce the analyses of: Risso and Pagnotta (2019). Within-sample standardization and asymmetric winsorization lead to accurate classification of RNA-seq expression profiles. In preparation.

This file belongs to the repository: https://github.com/drisso/awst_analysis.

The code is released with license GPL v3.0.

Install and load awst

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("drisso/awst")

Data import and cleaning

The collection of the SEQC datasets is available throught the seqc Bioconductor package. It can be installed with the following.

BiocManager::install("seqc")

We next build the data matrix from the “ILM_aceview” experiments. We remove duplicate gene symbols, ERCC spike-ins, and genes with no ENTREZ ID.

Distribution of samples per sites
AGR BGI CNL COH MAY NVS Sum
A 4 5 5 4 5 4 27
B 4 5 5 4 5 4 27
C 4 5 5 4 5 4 27
D 4 5 5 4 5 4 27
Sum 16 20 20 16 20 16 108

Figure 2 and Supplementary figure 1

Figure 2a and Supplementary figure 1a: clustering on RSEM data after awst

Figure 2b and Supplementary figure 1b: clustering on CPM data after awst

Figure 2c: clustering on CPM data (std values)

Figure 2d and Supplementary figure 1d: clustering on CPM data (std values; top 100 genes)

Figure 2e: clustering on TPM data (std values)

Figure 2f and Supplementary figure 1f: clustering on TPM data (std values; top 100 genes)

Figure 2g and Supplementary figure 1g: clustering on TPM data after awst

Figure 2h and Supplementary figure 1h: clustering on TPM data after Hart transformation

Figure 3 and Supplementary figure 2

3 perturbed samples

Figure 3a and Supplementary figure 2a: clustering on CPM data (std values; top 2500 genes)

Figure 3e and Supplementary figure 2e: clustering on RSEM data after awst

Two-third perturbed samples

Figure 3c and Supplementary figure 2c: clustering on CPM data (std values; top 2500 genes)

Figure 3g: clustering on RSEM data after awst

All perturbed samples

Figure 3d and Supplementary figure 2d: clustering on CPM data (std values; top 2500 genes)

Figure 3h and Supplementary figure 2h: clustering on RSEM data after awst

Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] awst_0.0.3        dendextend_1.12.0 cluster_2.1.0     knitr_1.25       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2        pillar_1.4.2      compiler_3.6.1   
##  [4] highr_0.8         viridis_0.5.1     tools_3.6.1      
##  [7] digest_0.6.21     evaluate_0.14     tibble_2.1.3     
## [10] gtable_0.3.0      viridisLite_0.3.0 pkgconfig_2.0.3  
## [13] rlang_0.4.0       yaml_2.2.0        xfun_0.10        
## [16] gridExtra_2.3     stringr_1.4.0     dplyr_0.8.3      
## [19] grid_3.6.1        tidyselect_0.2.5  glue_1.3.1       
## [22] R6_2.4.0          rmarkdown_1.16    ggplot2_3.2.1    
## [25] purrr_0.3.2       magrittr_1.5      scales_1.0.0     
## [28] htmltools_0.4.0   assertthat_0.2.1  colorspace_1.4-1 
## [31] stringi_1.4.3     lazyeval_0.2.2    munsell_0.5.0    
## [34] crayon_1.3.4